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ABSTRACT 

A  new  approach,  micro  benchmarks,  has  recently  been  developed. 
Using  this  technique,  we  have  analyzed  the  KSRl,  and  in  particu¬ 
lar  the  "ALLCACHE"  memory  architecture  and  ring  interconnec¬ 
tion.  We  have  been  able  to  elucidate  many  facets  of  memory  per¬ 
formance.  The  technique  has  enabled  us  to  identify  and  character¬ 
ize  parts  of  the  memory  design  not  described  by  Kendall  Square 
Research.  Our  results  show  that  a  miss  in  the  local  cache  can  irtcur 
a  penalty  ranging  from  7.5  microseconds  to  500  mio-oseconds 
(when  a  dirty  "page"  in  the  local  cache  must  be  evicted).  The  pro¬ 
grammer  must  be  very  careful  in  placement  and  accessing  of  data 
to  obtain  maximum  performance  from  the  KSRl;  the  data 
presented  here  will  help  in  understanding  the  performance  actually 
obtained. 

1.  Introduction 

The  KSRl  from  Kendall  Square  Research  is  a  novel  new  parallel 
computer.  It  is  the  first  commercial  machine  embodying  a  scal¬ 
able  all  cache  form  of  shared  memory  architecture.  In  addition, 
there  are  a  number  of  other  interesting  features  of  the  machine. 

We  report  our  observations  of  die  KSRl,  obtained  tty  means  of  a 
suite  of  small  benchmarks  diat  expose  the  details  of  the  machine 
characteristics.  We  refer  to  these  small  benchmarks  as  micro 
benchmarks.  In  section  2  we  briefly  describe  the  micro  bench¬ 
mark  ^iproach,  and  its  application  to  parallel  machines.  The 
micro  benchmark  suite  has  b^  developed  and  used  to  analyze  the 
paformance  of  uniprocessor  machines  [12,  13].  We  describe  die 
architecture  of  the  KSRl  as  we  understand  it  in  section  3.  We 
have  run  our  standard  micro  bendimark  smte  for  processor  perfor¬ 
mance,  and  included  the  results  together  with  conqiarative  results 
from  two  other  CPUs  of  interest  The  main  focus  of  our  work  has 
been  to  understand  and  measure  the  performance  of  the  KSRl’s 
novel  "ALLCACHE"  memory,  which  we  have  extensively 
analyzed  with  a  new  set  of  micro  benchmarks.  This  work  and  the 
results  are  described  in  section  5.  Section  6  analyses  a  set  of 
experiments  used  to  measure  the  effect  of  contention  in  the  inter¬ 
connection  network. 

2.  The  Micro  Benchmark  Approach 

Recendy,  one  of  us  (Saavedra)  has  explored  a  new  qiproach  to 
boichmark  analysis  of  computers.  This  approach  has  b^  docu¬ 
mented  in  several  pqiers  [12,  13,  14].  The  ^iproach  was 
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developed  in  reaction  to  the  use  of  large  qiplications  as  bench¬ 
marks.  Though  it  is  hoped  that  large  plications  will  be  more 
representative  of  real  workloads  tfian  synthetic  benchmarks  or 
small  kernels,  it  is  not  clear  what  features  of  a  particular  system 
they  exercise,  or  what  actually  accounts  for  the  differences  in  the 
performance  of  these  benchmarks  on  different  machines. 

The  micro  benchmark  approach  returns  to  the  idea  of  measuring 
specific  features  of  the  madiine.  But  in  contrast  to  measuring  only 
a  few  parameters,  such  as  floating  point  multiply,  the  tpproach 
consists  of  (1)  measuring  every  observable  feature  of  the  machine, 
and  (2)  malting  use  of  the  collected  set  of  data  in  an  integrated 
way.  For  uniprocessors,  one  of  the  most  powerful  ways  of  using 
die  micro  benchmarks  is  to  predict  the  performance  of  a  program 
on  a  new  conqiuter  without  first  porting  the  program  and  measur¬ 
ing  the  results.  This  is  done  by  analyzing  Ae  program  to  deter¬ 
mine  how  much  use  is  made  of  eadi  of  the  machine  features  meas¬ 
ured  by  the  micro  benchmadts,  and  then  using  the  results  of  the 
execution  of  only  the  micro  benchmark  suite  on  the  new  computer 
to  predict  the  running  time  of  the  program.  Using  the  approach,  it 
has  proven  possible  to  estimate  accurately  die  performance  of  a 
wide  range  of  programs,  including  standard  benchmark  suites  sudi 
as  Spec  and  Perfect  [16,  3].  Further,  the  analysis  of  these  pro¬ 
grams  in  terms  of  the  features  measured  by  the  micro  benchmark 
suite  gives  insight  into  the  reasons  for  the  observed  performance 
diflerences  for  Ae  programs  on  different  computers. 

An  important  factor  in  machine  performance  is  cache,  memory, 
and  network  interconnect  behavior.  A  test  has  been  developed  that 
reveals  a  great  deal  about  die  memory  hierarchy  behavior  (includ¬ 
ing  the  network),  and  a  way  of  displaying  the  data  from  this  test 
using  a  set  of  diagrams  that  we  have  named  Physical  and  Perfor¬ 
mance  Profiles  (or  P^  diagrams)  has  been  developed.  We  will 
explain  diis  test,  and  the  main  characteristics  of  the  P^  diagrams, 
below.  The  data  dius  obtained,  together  with  information  about 
die  rate  of  misses  in  a  program,  is  factored  into  the  other  micro 
benchmark  results  as  part  of  the  prediction  methodology. 

This  paper  reports  results  of  the  micro  benchmark  analysis  of  the 
KSRl.  The  machine  contains  many  features  (described  below)  not 
found  in  other  machines.  To  underetand  the  performance  implica¬ 
tions  of  these  features,  it  has  been  necessary  to  develop  additional 
micro  benchmarks  beyond  the  initial,  general  purpose  suite.  These 
are  described  later. 

An  open  and  interesting  question  is  die  analysis  of  parallel  pro¬ 
grams,  and  the  prediction  of  their  performance  throu^  the  use  of 
micro  beiKhmt^.  This  involves  developing  new  tests  for  syn- 
dironization,  access  to  shared  variables,  and  probably  features  pro¬ 
vided  in  run  time  libraries  for  parallel  machines. 


Figure  1:  KSRl  cache  and  subcache  organization. 


However,  it  is  not  clear  at  this  point  which  factors  are  relevant  to 
take  into  account  in  building  a  reasonable  model  of  program  exe¬ 
cution  which  can  produce  predictions  about  the  execution  time  of 
parallel  jwograms.  Furthetmore,  we  believe  that  a  substantial 
experimental  understanding  about  the  performance  regimes  exhi¬ 
bited  by  shared  memory  machines  is  required  before  any  such 
model  can  be  develofied.  We  will  report  separately  our  resets  for 
message  passing  machines,  and  for  other  shared  memory  machines 
such  as  the  Stanford  DASH  [8],  for  which  we  have  also  developed 
new  micro  benchmarks. 

As  will  be  seen  below,  our  approach  reveals  a  great  deal  of 
interesting  information  about  the  KSRl.  Thwe  is  much  more 
woric  to  be  done,  however,  to  extend  the  approach  into  the  area  of 
prediction. 

3.  Architecture  of  the  KSRl. 

The  KSRl  is  a  new  ardutecture  for  parallel  machines,  and  one  that 
attempts  to  solve  the  problem  of  scalability  in  shared  memory  mul¬ 
ticomputers.  The  problem  is  that  as  the  number  of  processors 
increases,  the  average  cost  of  access  to  memory  goes  up.  One 
approach  has  been  to  design  faster  memory  interconnects,  together 
with  highly  interleaved  mwnory.  This  ipproach  appears  to  be  lim¬ 
ited  to  at  most  a  few  hundred  processors. 

An  alternate  approach,  explored  by  both  Kendall  Square  and  the 
Stanford  DASH  project,  is  to  use  a  directory-based  caching 
scheme  with  a  relatively  large  amount  of  memory  close  to  each 
individual  processor.  Both  of  these  machines  use  a  "message- 
passing"  interconnect  mechanism  rather  than  more  typical  memory 
interconnect  methods.  In  the  case  of  die  KSRl,  it  is  a  ring  of  rings 
[6],  while  the  Stanford  DASH  uses  a  2  dimensional  mesh  intercon¬ 
nection  [8]. 

The  KSRl  is  organized  as  a  ring  of  rings,  wifo  up  to  diirty  two 
processors  cormected  to  each  lowest  level  ring,  called  ring:0. 
Ring:0  rings  are  interconnected  by  another  ring,  ring:l.  Each  node 
consists  of  a  CPU,  a  512-KByte  subcache,  32  MBytes  of  cache 


memory,  a  cache  directory,  a  ring  interface,  and  an  VO  interface. 
The  directory  supports  the  cache  coherency  protocol  between  pro¬ 
cessors  and  between  rings.  A  total  of  1088  processors  cm  be  con¬ 
nected  as  34  ringrO  rings  interconnected  by  one  ring:l  ring.  Ken¬ 
dall  Square  Research  intends  to  extend  the  hierarchy  so  as  to  con¬ 
nect  even  mote  processors  in  future  designs. 

The  processing  component  has  four  functional  umts:  integer,  float¬ 
ing  point,  control,  and  an  I/O  unit  Instructions  are  issued  in  pairs; 
m  integer  or  floating  point  instruction  paired  with  a  control  or  I/O 
instruction.  The  machine  is  a  load/store  architecture,  with  loads 
md  stores  issued  by  the  control  unit.  Some  floating  point  instruc¬ 
tions  result  in  two  floating  point  operations.  There  are  32  integw 
registers,  64  floating  point  registers,  md  32  addressing  registers. 
The  integer  and  floating  point  registers  hold  64  bits,  the  addressing 
registers  are  40  bits.  The  machine  operates  at  20  megahertz  md  is 
folly  pipelined,  witfi  two  brmch  delay  slots. 

The  user  view  of  memory  (called  the  Context  Address  -  CA)  is  a 
segmented  address  space.  Segments  cm  rmge  from  2^  to  2*® 
bytes  (in  the  current  inqilementation)  in  length.  The  processor 
produces  40  bit  addresses,  interpreted  as  a  segment  number  md 
offset  The  processor  contains  m  instruction  segment  table,  with  8 
entries,  md  a  data  segment  table  of  16  entries. 

The  addresses  generated  by  the  processor  are  translated  into  Sys¬ 
tem  Virtual  Address  (S  VA)  space.  The  segment  tables  include  the 
base  location  of  the  segmmt  in  SVA,  the  segment’s  length,  md 
access  permissions.  Presumably,  the  segment  tables  are  folly  asso¬ 
ciative. 

3.1.  KSRl  ALLCACHE  Memory  Architecture 

The  memory  in  each  processor  is  organized  as  a  two-level  cache 
hierarchy.  There  is  a  large  local  cache  that  combines  the  funedons 
of  memory  and  a  second  level  cadie,  md  a  subcache  that  is  the 
first  level  cache.  Both  caches  are  mmaged  by  a  directory  structure 
consisting  of  a  large  unit  of  allocation  in  the  cache  directory,  md  a 
smaller  imit  of  data  transfer,  as  shown  in  Figure  1.  This  is  in  con- 


trast  with  more  common  cache  organizations  in  which  the  units  of 
allocation  and  transfer  are  die  same. 

There  are  instruction  and  data  subcaches,  each  256K  bytes.  Each 
subcache  is  managed  by  a  subcache  directory  with  128  blocks  each 
containing  32  subblocks.  The  directory  is  organized  as  64  two- 
way  associative  sets  widi  a  random  replacement  policy.  A  lefa- 
ence  to  a  new  subcache  blodt  incurs  the  overhead  associated  with 
invalidating  all  die  subblocks  of  the  block  being  displaced.  The 
subcache  is  cache  coherent  widi  the  local  cache.  The  subblocks 
each  contain  64  bytes  of  data.  A  read  fiom  the  subcache  is  satis¬ 
fied  in  3  clock  cycles  (load  instructions  have  2  delay  slots). 

There  is  a  32  megabyte  local  cadie  in  each  node.  There  is  no 
memory  with  a  fixed  address  (hence  the  term  "ALLCACHE"). 
Instead,  memory  is  managed  through  a  directory  structure  that 
makes  all  the  cache  memory  in  the  machine  visible  and  accessible 
to  every  processor.  A  similar  memory  architecture  has  been 
described  in  [S],  where  the  term  Cache  Only  Memory  Architecture 
(COMA)  is  us^. 

The  local  cache  is  16-way  set  associative,  and  is  organized  in  16K 
byte  "pages".  Each  page  is  divided  into  128  "subpages"  of  128 
bytes.  These  subpages  are  the  unit  of  memory  coherence.  The 
directory  contains  an  entry  for  each  of  the  2K  pages  that  comprise 
the  memory.  Each  entry  includes  a  tag  and  the  state  of  each  sub- 
page,  of  which  the  visible  states  are  invalid,  read-only,  exclusive, 
and  atorruc.  Exclusive  means  that  this  is  the  only  copy  in  any 
processor’s  local  cache.  Atomic  is  exclusive  and  locked. 

Neither  the  whole  32  MBytes  nor  the  16-way  associativiQr  is 
accessible  from  a  user  program.  Tire  OS  sets  aside  a  significant 
number  of  pages  (close  to  30%  of  the  total  memory)  for  its  own 
use  and  these  catmot  be  displaced  out  of  the  cache.  We  discuss  in 
detail  the  effective  associativity  experienced  by  user  programs  in 
section  5Ji. 

A  consequence  of  this  design  is  that  if  tiie  processor  references  a 
subpage  from  a  new  page,  all  the  lines  from  another  page  present 
in  the  cache  may  need  to  be  evicted.  In  order  to  keep  at  least  one 
copy  of  all  pages  currently  being  referetKed,  each  page  is  assigned 
a  "home"  node.  The  home  page  provides  space  for  every  subpage, 
even  if  there  are  no  valid  subpages  at  the  home  node.  Having  a 
home  node  greatly  simplifies  evicting  a  subpage,  as  there  is 
guaranteed  to  be  a  node  with  room  for  the  sub^ge.  If  the  home 
node  must  evict  a  page,  the  operating  system  will  swap  the  page  to 
disk  or  to  another  node's  local  cache  to  make  room  for  the  new 
page.  A  significant  fraction  of  the  node's  memory  is  set  aside  by 
tire  operating  system  to  be  used  as  home  pages  [2]. 

The  instruction  set  includes  instructions  to  prefetch  and  poststore 
subpages.  A  prefetch  allows  the  processor  to  read  a  subpage  from 
aruther  node  without  having  to  stall  while  die  request  is  serviced. 
There  can  be  up  to  foirr  prefetches  outstanding  at  any  time.  If  this 
limit  is  reached,  then  additional  prefetches  are  either  discarded  or 
tire  processor  is  forced  to  block  until  one  of  the  four  pending  pre¬ 
fetches  completes.  A  field  in  the  prefetdi  instruction  determines 
the  action  to  follow.  An  additional  field  indicates  the  state  of  the 
subpage  tiiat  is  to  be  read;  exclusive  or  read-only.  A  poststore 
instruction  causes  the  local  cache  to  broadcast  a  re^-only  copy  of 
a  sul^age.  All  nodes  with  the  subpage’s  page  allocated  in  Aeir 
cache  will  read  it,  provided  the  cache  directory  is  not  busy. 

A  multi-ring  machine  contains  a  ring  interface  in  each  ring:0  ring 
that  is  a  directory  for  tire  entire  ring  (that  is,  it  contains  an  entry  for 
every  page  that  is  in  any  processor  cache  in  the  ting). 

The  data  rate  of  ring;0  is  1  gigabyte/secorrd.  Ringrl  supports  a 
"fat"  structure  with  multiple  rings  to  provide  1.2,  2.4,  or  4.8 
Gbytes/s  bandwidth.  Increasing  the  numto  of  subrings  in  a  ringrl 


structure  reduces  the  total  number  of  available  nodes  that  can  be 
used  for  processing  in  ringrO. 

The  KSRl  provides  sequential  consistency,  which  implies  that 
writes  to  a  subpage  catmot  complete  until  all  other  copies  present 
in  the  machine  have  been  invalidated  [7]. 

3.2.  Paging  on  the  KSRl 

The  KSRl  scheme  of  allocating  space  in  the  caches  in  large, 
page-sized  units  and  filling  in  units  of  cache  lines  (sul^ages) 
^rpears  to  be  a  reasonable  compromise.  If  we  consider  the  amount 
of  storage  required  for  cache  directory  information  we  see  that  the 
current  implementation  will  require  only  about  100KB.  This  is 
calculated  fiom  having  2048  page  entries,  each  of  which  includes 
an  19-bit  tag  and  at  least  3  state  bits  for  each  of  the  128  subpages. 
A  simpler  implementation  that  separately  allocated  each  subpage 
would  require  more  than  700KB  of  directory  storage. 

The  cost  of  this  method  is  that  an  entire  page  must  sometimes  be 
evicted  due  to  a  subpage  miss.  There  are  2**  pages  in  SVA  that 
index  to  the  same  set  of  page  positions  (see  Figure  1),  and  the 
cache  is  16  way  set  associative  at  the  page  level.  So  sequoitial 
references  to  a  set  of  only  17  pages  can  (in  the  worst  pathological 
case)  cause  a  page  eviction  at  every  reference. 

An  advantage  of  the  paged  scheme  is  tiiat  it  matches  disk  accesses 
well.  When  accessing  a  disk,  a  system  wants  to  get  big  chunks 
from  the  disk  since  it’s  slow.  WiA  local  cache  pages,  KSRl  can 
quickly  clear  a  page’s  state  to  make  room,  especially  since  the  disk 
page  is  the  same  size.  Without  cache  pages  a  system  would  have  to 
clear  each  block  individually.  This  is  likely  to  be  done  serially, 
since  the  tags  would  be  in  sequential  RAM  locations. 

The  KSRl  implementation  saves  a  significant  amount  of  storage 
per  node.  This  also  greatly  affects  the  ring  interface  node  which 
must  duplicate  the  entire  state  information  of  aU  32  nodes  on  a 
iing;0.  This  is  an  important  consideration  when  looking  at  largra^ 
configurations  of  the  KSRl,  as  tiie  ring  directory  must  not  become 
a  bottieneck  if  the  system  is  to  be  scdable.  More  important  than 
the  amount  of  storage  in  the  ring  directory  is  the  time  and  cost  to 
search  it;  it  requires  checking  each  of  the  S 12  (32  caches  *  16-way 
associativity)  entries  which  might  have  a  copy  of  the  referenced 
page. 

3.3.  Comments  on  COMA  and  NUMA 

The  COMA  organization  of  the  KSRl  contrasts  with  a  more  tradi¬ 
tional  directory-based  NUMA  (non-uniform  memory  access) 
machine  such  as  tiie  Stanford  DASH  by  treating  all  of  its  main 
memory  as  a  cache.  NUMA  machines  treat  a  memory  address  as 
having  a  static,  known  location.  The  distance  between  a  processor 
and  the  various  main  memory  modules  differs,  leading  to  the  non- 
uniform  access  distances.  Both  architectures  cormnonly  use  the 
cache  block  as  the  coherency  unit  (i.e.  sharing  occurs  on  a  block 
basis).  Note  that  the  KSRl  and  Stanford  DASH  both  include 
caches  below  the  level  of  main  m«noty  to  improve  performance. 
The  DASH  has  first  and  second-level  cadies,  while  the  KSR  has 
its  subcadie. 

One  important  difference  in  the  architectures  occurs  when  access¬ 
ing  shared  data.  When  a  NUMA  misses  in  its  cache  for  con¬ 
sistency  reasons,  it  will  initiate  a  transaction  to  a  node  which  is 
functioning  as  the  "home"  for  this  address.  However,  a  COMA 
machine  cannot  direct  its  transaction  to  a  specific  location,  instead 
it  issues  a  request  that  will  search  the  caches  of  the  system  until  it 
finds  the  requested  data.  Determining  which  organization  will  be 
faster  for  a  particular  access  dqiends  upon  the  relative  locations  of 
shared  data  and  the  details  of  the  coherency  protocol. 


Factor 

KSRl 

Alpha 

DASH 

Intego'  Add 

44.0  ns 

6.1  ns 

33.9  ns 

Integw  Multiply 

925  ns 

98.1  ns 

3605  ns 

Integer  Divide 

4968.8  ns 

198.9  ns 

10233  ns 

F-Point  Add 

52.5  ns 

22.9  ns 

88.2  ns 

F-Point  Multiply 

22.1ns 

20.0  ns 

1473  ns 

F-Point  Divide 

1760.6  ns 

171.9  ns 

702.4  ns 

Complex  Arith. 

3199.1  ns 

182.7  ns 

7803  ns 

Intrinsic  Func. 

7%9.6ns 

11343  ns 

2683.0  ns 

Logical  Ops. 

1485  ns 

273  ns 

106.7  ns 

Branch/Switch 

148J2ns 

15.0  ns 

36.4  ns 

Proc.  Calls 

757.2  ns 

62.0  ns 

379.6  ns 

Array  Indexing 

473  ns 

39.7  ns 

1114ns 

Loop  OvCThead 

101.6  ns 

195  ns 

105.6  ns 

Table  1:  Single  CPU  Perfoimance  of  the  KSRl,  DEC  Alprfia 
4000/610,  and  DASH. 

4.  Summary  of  KSRl  CPU  Micro  Benchmark  Measure¬ 
ments 

As  mentioned  above,  a  micro  benchmark  suite  that  measures  CPU 
performance  has  previously  been  developed  and  used  to  obtain 
measurements  on  a  variety  of  uniprocessor  machines.  Results  for 
many  systems  have  beat  reported  in  [12,  13].  We  have  run  die 
same  suite  on  the  KSRl  CPU. 

The  CPU  micro  benchmarks  are  machine-independent,  so  instead 
of  measuring  machine  instnictions  they  measure  operations 
deHned  in  a  high  level  abstract  machine.  The  abstract  madhine  is 
based  on  the  Fortran  programming  language,  so  qjplications  writ¬ 
ten  in  Fortran  compUe  directly  into  the  abstract  machine  code. 
The  number  and  type  of  operations  is  directly  related  to  the  kind  of 
language  constructs  present  in  Fortran.  Most  of  these  are  associ¬ 
ated  with  arithmetic  operations  and  trigonometric  functions.  In 
addition,  there  are  parameters  for  procedure  call,  array  index  cal¬ 
culation,  logical  operations,  branches,  and  do  loops. 

In  Table  1  we  present  the  results  we  obtained,  together  with  the 
results  of  running  the  same  micro  benchmaiks  on  the  Stanford 
DASH  and  a  DEC  Aljiia  400  model  610  system  running  at  160 
MHz.  The  processor  in  the  Stanford  DASH  is  the  MIPS  R3000 
running  at  33  MHz.  In  future  work,  we  plan  to  use  these  bench¬ 
marks  and  provide  a  comparison  of  the  KSRl  with  a  number  of 
other  parallel  machines. 

5.  Analysis  of  the  KSRl  Memory  Architecture 

As  discussed  above,  a  general  methodology  for  analyzing  the 
memory  behavior  of  machines  with  caches  has  been  described  in 
[13].  We  have  extended  this  methodology  with  additional  bench- 
marics  that  measure  the  memory  hierarchy  behavior  of  shared 
memoiy  multiprocessors.  Here,  we  give  a  taief  explanation  of  flie 
approach,  and  present  die  results  we  have  obtained  for  the  KSR 1 . 

5.1.  Methodology 

There  are  many  specific  measurements  one  can  make  of  a  memory 
system.  In  general,  there  may  be  several  levels  of  cache  in  addi¬ 
tion  to  the  main  memory  in  the  system.  The  main  memory  may  be 
a  single  global  module  or  distributed  among  the  nodes.  If  the 
memory  is  distributed,  the  processors  may  treat  each  module  as 
local  memoiy,  or  may  share  the  memoiy  of  all  modules  in  a  glo¬ 
bal,  shared  address  space.  The  proi)erties  of  the  memoiy  intercon¬ 
nect,  including  bandwidth  and  latency  undo"  a  variety  of  loads,  are 


of  interest  There  may  also  be  a  write  buffer  associated  with  each 
level  of  cache.  There  may  be  separate  cache  coherency  directories 
as  well  as  die  cadie  itself.  It  is  a  challenge  to  simply  measure  the 
performance  of  all  of  these  mechanisms  in  a  way  that  shows  what 
happens  under  a  variety  of  conditions.  It  is  clear  that  a  few  simple 
numbers  are  far  from  sufficient  to  characterize  memoiy  behavior. 

Beyond  the  issue  of  obtaining  measurements  that  characterize  the 
behavior  of  the  memmy  architecture  under  a  full  range  of  condi¬ 
tions,  there  is  die  problem  of  presenting  die  results  in  a  more 
meaningful  form  than  a  large  table  of  measurements,  or  reducing 
die  results  to  a  few  average  numbers.  Most  useful  would  be  a 
presentation  of  die  results  that  allows  a  programmer  with  a  specific 
application  to  understand  what  die  memory  performance  of  his 
program  would  be.  We  have  developed  a  method  of  displaying  the 
results  that  captures  a  significant  amount  of  information  in  graphi¬ 
cal  form.  We  called  these  diagrams  Physical  and  Performance 
Profiles  {P^  diagrams)  of  the  memory  subsystem  as  they  contain 
the  physical  characteristics  of  each  memoiy  structure  in  addition  to 
the  perfoimance  characteristics. 

5.2.  The  Structure  of  the  Physical  and  Performance  Pro¬ 
files 

Because  of  the  complexity  of  the  memoiy  architectures  of  interest, 
the  P^  diagrams  require  some  effort  at  interpretation,  but  they  have 
a  great  advantage  as  compared  with  a  set  of  tables  of  results,  and 
provide  far  more  information  than  averages  which  summarize  the 
measurements.  The  P®  diagrams  are  a  set  of  plots  rqnesenting  the 
average  execution  time  needed  to  read,  modify  and  write  (a  R-M- 
W  cycle)  a  single  clement  in  a  sequence  of  locations  (not  neces¬ 
sary  contiguous)  taken  from  a  region  of  memory  as  a  function  of 
the  size  of  the  region  (R )  and  the  distance  (stride  S  )  between  con¬ 
secutive  elements.  An  dtemative  experiment  consists  of  reading 
clanents  without  changing  their  values.  We  refer  to  this  type  of 
expiCTiment  as  read-use  cycle  (R-U  cycle). 

The  access  times  are  measured  by  timing  the  execution  of  a  For¬ 
tran  loop.  Each  data  point  on  a  curve  is  the  mean  time  per  itera¬ 
tion  calculated  from  performing  a  fixed  number  of  accesses  to  an 
array  of  the  given  size,  using  that  stride.  The  clock  resolution  of 
the  machine  is  20  psec.  By  factoring  out  loop  overhead  and 
averaging  over  a  large  number  of  iterations,  we  believe  that  the 
error  in  our  results  is  generally  less  than  a  clock  cycle. 

Dqiending  on  the  relative  magnitudes  of  R  and  S  of  a  R-M-W 
experiment  with  respect  to  the  size,  width,  and  associativity  of  the 
structures  forming  the  memory  hierarchy  a  distinctive  value  for  the 
average  execution  time  is  obtained.  All  results  are  depicted  as  a 
set  of  curves,  where  each  curve  corresponds  to  a  particular  value 
of  R,  with  all  values  of  5=2"  from  1  to  R/2  plott^  In  this  sec¬ 
tion  we  briefly  explain  how  to  read  these  diagrams.  A  more  exten¬ 
sive  discussion  can  be  found  in  [14]. 

For  explanatory  purposes,  tiie  discussion  will  focus  in  the  effects 
of  our  experiments  on  a  memory  hierarchy  consisting  of  a  single 
cache.  The  explanation  extends  trivially  to  more  complex  hierar- 
diies  and  in  what  follows  we  provide  some  comments  in  this 
respiect.  Depending  on  the  values  of  R  and  S  with  resp^  to  the 
size  of  the  cadie  C ,  the  line  (blodc)  size  b  and  associativity  a ,  we 
can  observe  one  of  four  basic  regimes.  Furthermore,  the  response 
of  a  more  complex  memory  hierarchy  is  just  the  superposition  of 
the  memory  structures’  individual  responses,  which  always  fit  one 
of  the  four  basic  regimes. 

Ill  the  presentation  we  assume  that  all  variables  take  values  that  are 
powers  of  two.  However,  if  one  of  the  jrfiysical  dimensions  of  a 
memory  structure  hai^iens  not  to  be  a  power  of  two,  it  will  be 
necessary  to  use  different  sequences  of  values  for  R  and  S. 


5 


Figure 


Eadi  micro  beiKhmark  consists  of  making  multiple  passes  over  an 
array  of  size  R ,  accessing  every  S*  element  The  first  pass  (at 
stride  1)  over  region  R  will  incur  some  cold  misses,  but  this  error 
is  negligible  due  to  the  lertgth  of  the  micro  benchmark.  Micro 
benchmarks  at  larger  strides  will  incur  no  cold  misses,  as  fliey 
touch  a  smaller  set  of  elements. 

The  simplest  regime  (regime  1)  occurs  when  R  <C .  Here, 
independently  of  the  stride,  all  elements  accessed  by  the  experi¬ 
ments  fit  in  the  cache,  so  there  are  no  cache  misses  in  the  steady- 
state  phase  of  the  experiment  Therefore,  the  average  time  of  the 
R-M-W  cycle  as  a  function  of  5  is  a  constant  line.  Curve  128^  in 
Figure  2  is  a  clear  example  of  regime  1*. 

When  R  >C,  misses  start  to  occur,  and  depmding  on  S  we  can 
observe  one  of  three  regimes  (2.a,  b,  c).  Regime  2.a  occurs  when 
S  <b.  Here,  fliexe  are  several  consecutive  accesses  to  each  cache 
line  in  between  corresponding  misses,  so  the  cache  miss  penalQr  is 
amortized  amongst  the  accesses.  As  S  grows,  die  average  time  for 
die  R-M-W  cycle  increases  in  proportion  to  5 .  All  curves  in  Fig¬ 
ure  2  where  R  2  512Ar  and  S  <  64  correspond  to  regime  2.a. 
Regime  2.b  represents  the  situation  whexe  each  reference  falls  into 
a  differoit  cache  line  and  it  always  generates  a  miss.  Formally, 
diis  is  true  only  if  die  cache  replacement  policy  is  either  FIFO  or 


•  Figure  2  represents  the  superposition  of  the  effects  of 
two  memory  structures:  the  subc:ache  subblock  and  block  or¬ 
ganizations.  All  four  regimes,  however,  are  clearly  identifi¬ 
able  in  the  figure  and  we  make  reference  to  it  to  illustrate  the 
regimes. 


LRU.  For  a  random  rqrlacement  policy,  the  effect  rapidly  con¬ 
verges  to  that  of  LRU  and  FIFO  as  the  number  of  lines  mapping  to 
a  set  increases  above  the  degree  of  associativity.  Regime  2.b 
occurs  when  b  <S  <R/a.  Here,  each  experiment  touches  a  sub¬ 
set  of  all  cache  sets,  but  the  number  of  cache  lines  mapping  to  a  set 
is  greater  than  the  associativity.  This  result  follows  fiom  the  fol¬ 
lowing  argument  There  are  Clab  sets  in  a  cache.  In  general,  an 
experiment  touches  R/(l>  f Sl6[ ) cache  lines  which  are  mapped  into 
C/(flif5/frl )  sets  if  SSC/a  or  into  a  single  set  if  S>Cla.  In 
regime  2.b,  S  2  b,  so  S/6  is  always  a  whole  number  greater  than 
one.  Therefore,  the  number  of  lines  touched  are  R/S  and  these  are 
mapped  into  either  ClaS  sets  or  a  single  one.  In  both  cases,  each 
set  receives  Ra/C  or  R/S  lines,  respectively,  and  it  follows  from 
conditionR  >C  that  Ra/C  >a  andR/S  >a. 

Therefore,  in  regime  2.b,  the  average  time  for  die  R-M-W  cycle  as 
a  functicm  of  S  is  constant,  assuming  there  are  no  other  effects 
produced  by  the  other  memory  structures,  bi  Figure  2,  regime  2.b 
corresponds  to  the  two  plateaus  present  in  all  cairves  in  the  regions 
R  2512JS:  and 64 <S  <256,  and R  2512 and 2A:  <S  <R/4.  The 
last  regime  (2.c)  occurs  when  the  number  of  different  cache  lines 
mapping  into  the  same  set  is  less  than  or  equal  the  set- 
associativity.  This  situation  is  characterized  condition 
Ria  iS  <R.  For  this  regime,  the  R-M-W  cycle  average  time 
drops  to  the  level  of  regime  1.  Furthermore,  the  ratio  RIS  at 
which  the  drop  occurs  gives  die  set-associativity  of  die  memory 
structure.  In  Figure  2  all  curves  where  R  2  512^1  exhibit  th^ 
behavior  at  their  rightmost  point,  indicating  that  the  set- 
associativi^  is  two. 

In  the  next  section,  we  discuss  some  specific  performance  charac¬ 
teristics  of  the  KSRl  that  are  observable  from  the  diagrams  our 
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experiments  generate.  In  the  diagr^  we  identify  the  KSRl 
regimes  by  using  roman  numerals  mstead  of  the  basic  regime 
numbers.  We  do  this  to  avoid  confusion,  because  most  of  the 
KSRl  regimes  rqwesent  the  superposition  of  several  b«ic  regimes 
affecting  different  memory  structures  in  the  memory  hierarchy. 

53.  KSRl  Single  Ring,  Single  Node  Performance 
Results 

In  Figure  3  we  show  the  pwformance  of  a  single  KSRl  node  while 
reading  data  ffom  the  cache.  The  figure  consists  of  curves  for 
regions  of  size  R=128Ar  to  mB ,  with  strides  of  8  bytes,  16  bytes. 
....  R/2  bytes.  The  KSRl  word  size  is  64  bits,  so  8  bytes  is  the 
smallest  stride. 

When  the  data  set  being  accessed  is  smaller  than  the  size  of  sub¬ 
cache  (regime  1)  there  will  be  no  cadie  misses  and  we  can  read  the 
base  time  per  iteration.  The  flat  curve  for  the  128KB  daU  set  in 
Figure  3  shows  this  case,  and  we  see  that  the  average  time  per 
iteration  of  die  loop  for  all  strides  used  w^  about  650 
nanoseconds.  This  is  the  time  to  perform  one  iteration  of  the  loop 
with  a  floating  point  add  and  multiply. 

The  size  of  die  largest  such  curve  with  no  misses  tells  us  the  size 
of  the  subcache.  In  this  case  we  see  that  the  subcache  is  256KB. 
The  line  is  not  completely  flat  due  to  interference  from  other  data 
used  by  the  process  —  the  data  set  is  the  same  size  as  the  cache  and 
any  accesses  to  other  data  will  cause  cache  misses. 

The  512KB  and  larger  curves  show  us  what  happens  in  regimes 
2.a,  2.b,  and  2.c.  The  data  is  initially  not  in  the  subcache  and  the 
first  reference  to  a  subblock  will  cause  a  cache  miss;  succe^ing 
references  to  the  same  subblock  will  hit  At  stride  8,  there  will  be 


one  miss  and  7  hits.  At  stride  16.  there  will  be  one  miss  and  only  3 
hits  to  each  subblock.  As  we  increase  the  stride,  we  decrease  the 
number  of  hits  and  the  cost  of  the  miss  is  amortized  over  feww 
accesses.  At  a  stride  of  64  bytes,  the  curve  flattens  out,  as  every 
reference  is  made  to  a  different  subblock.  This  indicates  the  tran¬ 
sition  from  regime  2.a  to  2.b  and  we  are  able  to  conclude  that  the 
subblock  size  for  the  subcache  is  64  lytes. 

We  can  also  read  the  time  taken  to  satisfy  a  miss  to  a  subblock 
measuring  the  difference  in  times  between  flie  case  with  a  miss  on 
every  access  (a  data  set  of  512KB  and  stride  of  64)  and  flie  case 
with  no  misses  (a  data  set  of  128KB).  We  see  on  the  KSRl  that 
this  is  approximately  1300  nanoseconds. 

Between  stride  64  and  2048  the  curves  repeat  the  same  patt^  of 
rising  access  times.  This  regime  shows  the  effect  of  accessing  a 
new  block  in  the  subcache.  There  is  a  second  major  inflection  in 
flie  curve  at  a  stride  of  2048  bytes;  this  corresponds  to  the  case  m 
which  every  reference  is  to  a  new  block  of  the  subcadie.  From 
these  curves,  we  are  able  to  deduce  that  there  is  a  directory  struc¬ 
ture  with  blocks  and  subblocks  which  manages  the  subcache 
(which  we  call  the  subcache  directory  -  Kendall  Square  Rese^ch 
has  not  published  any  information  about  this  aspect  of  the  architec¬ 
ture). 

At  a  stride  of  256K  bytes,  the  512K  curve  shows  that  the  cost 
of  a  read  is  the  same  as  the  cost  when  fliere  are  no  subcache 
misses.  In  contrast,  if  the  stride  is  128K  bytes,  the  cost  of  a  read 
includes  the  cost  of  a  subcache  page  miss.  From  this,  we  can  con¬ 
clude  that  the  subcache  is  two-way  set  associative  because  at  a 
stride  of  256K  bytes  only  2  different  subblocks  are  being  accessed 
and  they  map  to  a  single  set  in  the  cache. 
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KSRl  2  NODES  PHYS.  AND  PERF.  PROnLE  WITH  REPLACEMENT:  R-M-W  CYCLE 
cache  page  size:  16  Kbytes  ;  VI 


cache  page  fault: 
*500  jls 


cache  size:  32  Mbytes 
(not  all  accessible  to  program) 


subcache  Uock  size: 
^2048  bytes 


0.6(12)t 


.  subcache  subblock  size: 
64  bytes 


I6M  ■ 

8M - 

4M - 


2-way  associative  2;  8-way  associative 


32K  128K  512K 

Stride 


32M  128M 

bytes 


Figure  4;  Physical  and  performance  profiles  of  the  KSRl  subcache  and  cache  structures  obtained  using  2  competing  nodes  running 
R-M-W  based  experiments.  Each  regime  identifies  the  mean  execution  time  for  a  particular  combination  of  subcache  and  cache 
miss  penalties.  For  array  sizes  smaller  that  16MB,  there  is  no  contention  between  the  nodes. 


Figure  2  shows  the  case  of  a  loop  that  reads  a  word,  modifies  it, 
and  writes  the  result  back  to  memory.  Compared  with  the  read 
case  shown  in  Figure  3,  there  are  extra  costs  because  the  blocks  of 
the  subcache  have  been  modified.  This  increases  the  subcache  sub¬ 
block  miss  penalty  by  about  S60  nanoseconds  (approximately  11 
clock  cycles),  and  the  cost  of  a  subcache  block  miss  increases  by 
about  300  nanoseconds  (6  clock  cycles).  We  note  that  the  base 
case  is  650  nanoseconds,  as  it  was  for  the  read  case.  From  this  we 
conclude  that  the  write  is  completely  overlapped  with  the  loop 
branch  and  other  operations. 

5.4.  Subcache  Random  Replacement  Policy 

fii  Figure  3  we  can  also  observe  the  fact  that  the  subcache  replace¬ 
ment  policy  is  random.  This  manifests  itself  in  the  height  of  curve 
512K  which  reaches  a  lower  height  than  the  other  curves  in  regime 
m.  Regime  IQ  corresponds  to  basic  regime  2.b.  This  basic 
regime  assumes  either  an  LRU  or  FIFO  replacement  discipline  to 
enforce  that  every  reference  generates  a  miss  (a  subcache  subblock 
and  block  misses  in  this  case).  With  random  replacement,  how¬ 
ever,  some  subset  of  the  references  will  not  cause  misses. 

As  mentioned  in  section  5.2,  in  a  cache  of  size  C  and  associativity 
a ,  an  erqrerimoit  covering  a  region  R  will  mq)  RalC  >a  cache 
block  into  the  same  set  Now,  in  a  random  discipline,  the  proba¬ 
bility  that  a  cache  block  will  remain  in  the  cache  after  a  pass 
through  all  elements  m  the  experiment  is 


Consequently,  the  average  execution  time  per  R-M-W  cycle  in 
regime  IQ  (Tm )  should  be: 

Tia  =  +(l-p^yD^.  (2) 

where  and  are,  respectively,  the  average  execution 

time  without  misses  and  the  miss  delay  penal^.  Now,  if  we 
replace  the  KSRl  parameters  in  eqs.  (1)  and  (2)  we  get  that  the 
effective  subcache  miss  delay  pen dty  for  curve  512K  should  be 
7/8  =  0.875  of  .  The  results  in  Figure  3  for  curve  512Ar  exhi- 
Int  an  effective  delay  penalty  in  die  range  .86  to  .88. 

A  more  subtle  manifestation  of  the  random  replacement  policy  in 
Figures  2  and  3  is  the  decrease  in  the  effective  delay  pe^ty  for 
the  points  S  >C la  (128K)  in  curves  IM  and  higher.  In  all  these 
points  there  are  R  IS  cache  lines  mqrping  to  a  single  set  and  as  S 
increases  fewer  lines  are  mapped  to  the  set  A  similar  argumoit  to 
fliat  given  above  applies,  except  that  now  the  exponent  is  R/S  - 1. 
We  can  see  that  the  sf  "ond  point  from  the  right  (identified  by  sjm- 
bol  §),  which  corresponds  to  stride  S  =R/4,  should  also  have  an 
effective  delay  penalty  7/8.  Figure  3  shows  that  the  miss  penally 
drops  to  the  same  value  of  curve  S12K. 

The  previous  discussion  illustrates  how  effective  the  P*  diagrams 
are  in  capturing  the  complex  performance  space  exhibited  by 
shared  memory  machines. 

5.5.  KSRl  Single  Ring,  Two  Nodes  Performance  Results 

It  is  fairly  expensive  to  use  data  dial  resides  in  another  node  on  the 
same  ring  in  the  KSRl.  Figure  4  (see  also  Figures  C-3  and  C-4 
shown  on  the  color  plate  page)  shows  a  set  of  curves  for  read- 
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Figure  5:  Physical  and  performance  profiles  of  the  KSRl  subcache  and  cache  structures  obtained  using  2  competing  nodes  running 
R-M-W  based  experiments.  Each  regime  identifies  the  mean  execution  time  for  a  particular  combination  of  subcache  and  cache 
miss  penalties.  For  array  sizes  smaller  then  16MB,  there  is  no  contention  between  the  nodes. 


modify-write  (as  was  the  case  for  Figure  2),  when  the  data  sets  are 
as  large  or  larger  than  the  local  cache.  The  32  MByte  line  is  par¬ 
ticularly  interesting.  Not  all  the  data  can  fit  in  the  local  cache  at 
one  time,  since  some  space  is  needed  for  the  program,  and  perhaps 
for  the  operating  system  or  other  pages  for  which  the  executing 
node  is  Ae  home  cache.  But  clearly  there  is  enough  reuse  of 
blocks  that  even  at  large  strides  (8K  to  2M),  the  average  cost  of  the 
memory  accesses  is  somewhat  less  than  the  cost  of  large  strides  for 
larger  data  sets. 

As  we  mentioned  in  section  2,  the  KSRl  local  cache  is  32  MBytes 
16-way  associative.  However,  some  pages  arc  "wired"  by  the 
operating  system,  so  they  carmot  be  selected  as  victims.  In  addi¬ 
tion,  the  local  cache  acts  as  home  of  some  fraction  of  the  user’s 
pages.  Haice,  the  effective  cache  size  that  a  program  sees  is  signi¬ 
ficantly  less  flian  the  32  MBytes  and  the  effective  set  associativity 
is  less  than  16. 

Both  of  these  charactwistics  are  clearly  pres«it  in  Figure  4.  For 
example,  the  32M  curve,  which  in  principle  should  fit  in  the 
cache,  clearly  shows  the  presence  of  page  misses  with  replace¬ 
ment  If  the  entire  32  MBytes  were  available,  flie  curve’s  shape 
should  be  the  same  as  the  16  MByte  curve.  The  reason  why  the 
curve  reaches  its  highest  point  between  strides  16K  and  64K  and 
then  drops,  is  because  the  total  number  of  pages  touched  by  a  par¬ 
ticular  experiment  is  constant  for  all  strides  less  than  16K  and  then 
it  decreases  in  proportion  to  the  stride. 

With  respect  to  the  set  associativity,  the  three  rightmost  points  of 
all  curves  having  regions  greater  than  16  MBytes  indicate  that  the 
cache  associativity  is  8-way.  We  can  see  this  by  noting  that  when 


the  larger  curves  have  a  stride  of  1/8  their  size  (e.g.  4MB  stride  in 
32MB  data  set),  their  access  time  drops  back  to  the  subcache  block 
miss  level.  This  occurs  because  we  are  referencing  only  8  dif¬ 
ferent  local  cache  pages  and  they  map  to  a  single  set  in  the  local 
cache.  This  value  is  less  than  the  expected  16-way.  Because  our 
exjjeriments  change  R  and  S  only  in  powws  of  two,  we  detect  8- 
way  associativity,  while  its  real  value  can  be  any  number  between 
8  and  16.  We  have  performed  more  detailed  experimaits  and  have 
found  that  the  effective  associativity  varies  from  set  to  set  in  the 
range  from  3  to  12. 

Finally,  from  Figure  4,  we  observe  that  the  cost  of  removing  and 
replacing  a  page  in  the  cache  is  quite  large  (about  500  psec).  It 
should  be  borne  in  mind  that  one  or  more  blocks  in  each  page 
being  replaced  have  been  modified  (are  in  the  exclusive  state),  and 
must  be  evicted  before  a  new  block  of  the  new  page  can  be 
retrieved.  This  is  a  form  of  page  swapping  or  thrashing  between 
memories  in  the  same  ring,  and  does  not  require  that  the  pages 
being  evicted  be  written  to  disk. 

5.6.  KSRl  Two  Rings,  Two  Nodes  Performance  Results 

So  far,  we  have  (hscussed  the  performance  of  a  single  tKxle  in 
accessing  data,  though  the  data  may  reside  on  more  than  one  rxxle. 
We  rx)w  turn  our  attention  to  the  case  in  which  multiple  nodes  are 
writing  to  the  same  set  of  data.  Figure  5  {see  also  Figures  C-1  and 
C-2  shown  on  the  color  plate  page)  shows  an  exp«iment  in  whidi 
two  nodes  in  the  same  ring  access  data.  This  figure  is  like  Figure 
3,  but  with  two  additional  curves  labeled  "16M",  for  a  16  MByte 
data  set  Two  ptrocesses  on  different  nodes  are  simultaneously 
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Summary  of  the  KSRl  Mkro  Benchmark  Results  Using  One  and  Two  Nodes 


Regime 

Misses 

Rings? 

Evict 

dirty 

page 

R-M-W  Cycle  Iteration  Time 

Sub  Cache 

Local  Cache 

Total  Time 

Residual  Time 

subblock 

block 

subpage 

page 

time 

time 

Miss 

I 

no 

no 

no 

no 

n.a. 

n.a. 

0.65  |xs 

— 

baseline 

n 

yes 

no 

no 

no 

n.a. 

n.a. 

2.50  ps 

1.85  ps 

subblock 

m 

yes 

yes 

no 

no 

n.a. 

n.a. 

4.20  ps 

1.70  ps 

block 

TV 

yes 

no 

yes 

no 

same 

n.a. 

10.10  ps 

7.60  ps 

subpage 

V 

yes 

yes 

yes 

no 

same 

no 

11.80  ps 

1.70  ps 

blodt 

VI 

yes 

yes 

yes 

yes 

same 

yes 

520.00  ps 

508.00  ps 

page 

vn 

yes 

yes 

yes 

no 

diff 

n.a. 

28.60  ps 

26.10  ps 

subpage 

vm 

yes 

yes 

yes 

yes 

diff 

no 

31.10ps 

2.50  ps 

block 

Table  2:  Mean  execution  time  for  a  single  read-modify-write  iteration.  Eadi  regime  tepTtsenxs  a  combination  of  i?  and  5  producing  a 
particular  pattern  of  misses  to  a  subset  of  the  memory  hierarchy.  Column  **Evict  dirty  page"  shows  the  delay  involved  in  moving  a 
dirty  page  out  of  a  cache  after  a  cache  miss.  The  dirty  page  has  to  be  sent  back  to  the  "home"  node. 


accessing  the  data  set  in  read-modify-write  mode,  for  different 
strides.  In  one  case,  the  two  nodes  are  on  the  same  ring,  and  in  the 
second  case,  they  are  on  different  tings.  The  figure  stmws  that  the 
cost  of  a  miss  in  the  local  cache  is  about  7.6  psec  (at  stride  128, 
the  size  of  a  subpage,  every  read  misses),  if  the  subpage  is  in 
another  cache  on  the  same  ring.  Note  that  die  total  cost  of  the 
access  is  the  sum  of  all  the  misses,  about  10.1  psec. 

The  cost  of  accessing  smaller  data  sets  will  be  the  same  as  shown 
in  the  16M  curve,  since  the  two  processors  will  be  invalidating 
each  others’  cache  subpages.  The  experiment  was  designed  so  that 
the  starting  point  for  each  processor  was  separated  by  1/2  die  size 
of  the  data  set  (i.e.,  8  megabytes  apart).  In  this  way,  die  processors 
are  not  competing  for  the  same  data  at  the  same  time  except  at 
large  strides. 

From  the  curve  for  the  case  of  two  processes  located  in  nodes  on 
different  rings,  the  subpage  miss  penalty  is  27.8  psec.  The 
machine  used  for  the  experiment  had  two  ring:0  rings  intercon¬ 
nected  by  a  ring:l  ring.  We  assume  that  there  were  only  two  ring 
interfaces  in  ring:l.  If  traversal  time  in  ring:0  is  about  7.5  psec  for 
33  nodes  (including  the  ring  interface),  then  about  13  psec  are 
consumed  in  the  ring  interfaces  and  in  traversing  ring:l.  Since  the 
pw  link  data  rate  in  ring:l  is  the  same  as  ring:0  for  this  machine, 
the  cost  of  the  directory  operadons  and  ring  insertions  would 
appear  to  be  consuming  most  of  this  time. 

The  results  displayed  in  Figures  4  and  5  show  eight  performance 
regimes  (indicated  by  roman  numerals  in  the  figures).  Each  regime 
is  determined  1^  the  number  the  misses  it  triggers  in  a  number  of 
enclosing  levels  of  the  hierarchy.  Different  regimes  also  identify 
the  locations  from  which  a  miss  can  be  satisfied. 

Regime  I  (Figure  5)  represents  the  baseline  time;  the  case  whai  no 
misses  are  triggered  by  the  micro  benchmark.  Regime  n  atlds  to 
the  baseline  the  delay  due  to  subcache  subblock  misses,  while 
regime  m  includes  both  subblock  and  block  miss  delays. 
Regimes  IV  and  Vn  contain  the  effect  of  local  cache  subpage 
misses.  The  first  ctptures  the  case  when  the  miss  is  satisfied  by  a 
node  in  the  same  ring,  while  the  second  represents  reading  the  data 
from  a  remote  ring.  In  all  regimes,  except  VI,  the  region  of  data 
covered  by  the  micro  benchmarks  is  less  than  the  size  of  the  local 
cache.  Hence,  all  subpage  misses  occurring  in  these  regimes  are 
only  the  result  of  mutual  invalidations  betweai  the  nodes,  because 
both  nodes  need  exclusive  rights  over  the  subpage. 


1  read-modify-wiite  1 

1 

average  execution  time  1 

I  cache  subpage  miss 

I  subcache  block  miss  7600  ns  (152) 

subcache  subblock  miss  1^0®  (34) 


R-M-W  baseline  1*50  ns  (37) 

650  ns  (13) 

Figure  6:  Components  of  regime  V  (ring:0  latency). 


Li  regime  VI,  on  the  other  hand,  satisfying  a  miss  reqiures  either 
evicting  a  dirty  page  if  the  micro  benchmark  is  based  on  the  R-M- 
W  cycle  or  detecting  that  the  page  is  not  dirty  and  just  dropping  it 
if  it  is  based  on  the  R-U  cycle.  In  the  former  case  the  complete 
dirty  page  has  to  be  s«it  to  the  "home"  node.  In  both  situations 
there  is  a  significant  extra  penally  involved.  The  results  just  dis¬ 
cussed  are  summarized  in  Table  2. 

Figure  6  shows  the  component  times  of  a  memory  access  in 
regime  V. 

6.  Communications  Performance  of  the  KSRl 

The  performance  of  the  interconnection  network  has  a  significant 
effect  on  flie  overall  performance  of  parallel  computations  and 
greatly  affects  the  granularity  achievable.  The  experiments 
reported  in  the  previous  section  which  measured  the  performance 
of  the  memory  hierarchy  were  carefully  designed  to  minimize  the 
effects  of  loading  of  the  communications  network.  For  real  qjpli- 
cations,  both  memory  performance  and  communications  network 
performance  will  affect  the  overall  rate  of  computation.  In  this 
section,  we  report  on  our  experiments  to  investigate  the  perfor¬ 
mance  of  the  KSRl  ring  interconnection  network. 

The  experiments  are  similar  to  our  memory  experiments.  A  single 
shared  array  is  accessed  by  several  nodes.  The  array  is  divided 
into  equal  portions,  with  each  portion  accessed  in  a  read-modify- 
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Figure  7:  Communicauan  latency  as  a  function  of  the  number  of  nodes  commumcatmg. 


write  cycle  by  only  two  nodes.  The  placement  of  the  nodes  and 
die  assignment  of  portions  of  the  array  to  nodes  is  carefully 
designed  so  that  all  nodes  execute  their  portion  of  the  experiment 
in  in  about  the  same  amount  of  time.  (This  requires  care;  it  is  easy 
to  construct  experiments  in  which  the  performance  of  some  of  die 
nodes  is  much  worse  dian  other  nodes,  even  though  all  are  doing 
the  same  amount  of  work). 
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Eq.  (3)  does  not  iqiply  to  the  case  when  nodes  in  different  rings 
communicate.  Unfortunately,  we  do  not  have  enough  information 
about  ring:l  and  the  interfaces  between  ring:0  and  ting:l  to  pro¬ 
duce  a  realistic  analytical  model  in  this  case. 


6.1.  Contention  on  a  Single  Ring:  Analytic  Model 

Here  we  present  a  simple  model  for  the  extra  delay  in  latency  due 
to  contention  in  the  ring  in  the  case  when  all  communication  is 
local  to  ring:0.  We  then  compare  the  model  against  experimental 
results.  In  our  experiments,  the  time  pet  itnation  t/u,  can  be  bro¬ 
ken  into  two  components;  time  of  computation  (tem^ )  and  time  for 
communication  (teaw*).  Term  tca^n  has  additional  com¬ 
ponents:  the  communication  time  without  contention  (t«,-toiu)  and 
the  extra  delay  due  to  contention  The  later  term  is 

a  function  of  the  number  of  communicating  nodes  Let 

ngau  be  the  number  of  slots  in  ring:0.  On  the  average,  when  a 
node  wants  to  drop  a  packet  into  the  ring,  there  are 
(n.^..  -  /filer  other  messages  occupying  slots.  The  proba¬ 

bility  that  a  ran^m  slot  is  empty  is  given  by 

.  (thuxks 

Pamply  —  1  ~  «  .r 

^sicts  *iter 

Now,  the  number  of  consecutive  occupied  slots  passing  through  a 
node  before  an  empty  slot  is  found  follows  tiie  geometric  distribu¬ 
tion  with  parameter  Paiw  ■  Hence,  the  expected  number  of  con¬ 
secutive  occupied  slots  is  (l-PairiyyPaiviy^-  Given  that  the  time 
between  successive  slots  is  we  can  compute  the 

expected  extra  delay  in  latency  due  to  contention  as 


2  The  mean  number  of  slots  passing  through  a  node,  in¬ 
cluding  the  empty  one,  is  given  by 

1  _  I  ~Pemply  ^  j 


6.2.  Experimental  Results 

The  graphs  in  Figure  7  shows  the  experimental  values  for 
tcoom  +  tpemiuyinmxia)  for  varfous  numbcTS  of  nodes.  The  figure 
labeled  "RING-0  LATENCIES"  also  shows  the  latency  predicted 
by  eq.  (3)  in  the  single  ring  case. 

The  ring-0  results  in  Figure  7  clearly  show  that  in  the  case  of  a  sin¬ 
gle  ring,  evMi  when  the  latency  tends  to  increase  with  the  numbw 
of  nodes,  this  increase  is  relatively  modest.  In  fact  the  total 
increase  in  latency  going  from  2  to  32  nodes  is  less  than  15%  of 
the  original  There  is  a  small  error  between  the  analytical 

and  experimental  results  which  increases  with  the  number  of 
nodes.  The  maximum  error  observed  is  less  tiian  11%^.  This  is 
because  eq.  (3)  overestimates  tpaiai,,  by  assuming  that  tiur  is 
indqiendent  of  the  number  of  nodes  in  the  experiment.  In  actual¬ 
ity  the  time  pier  iteration  is  given  ly 

"t"  ^penalty (Pnotles'}* 

Considering  this  new  term  in  (1)  reduces  the  error  between  the 
experimental  and  analytical  results  to  approximately  less  than  5 
percent 

The  contention  penalty  when  node  communication  requires  send¬ 
ing  messages  through  ring:l  shows  a  more  interesting  behavior. 
Figure  7.b  distinctly  shows  two  performance  regimes,  one  for  less 
dian  32  nodes  and  another  for  more  than  32  nodes.  In  the  former 
case  increasing  the  number  of  nodes  by  one,  on  the  average 


3  The  11%  error  in  fj _ represents  only  a  2%  error  in 

terms  of  the  total  communication  latency. 
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increases  latency  152  ns  (*  3  cycles),  while  in  the  later  case,  the 
increment  is  as  large  as  1000  ns  C  20  cycles).  A  simple  model 
based  on  linear  fit  of  both  regimes  gives  the  following  formulas: 

r lS2nsxn  +  25200ns  for  n  <32 

L(«)  =  \  (4) 

[  1004ns  x(n- 32) +  27500 ns  forn>32. 

with  respective  correlation  coeffidents  of  0.9713  and  0.9116. 
When  the  number  of  active  nodes  increases  from  two  to  32  and  60, 
then  eq.  (4)  gives  a  relative  increase  of  21%  and  110%  respec¬ 
tively  in  the  total  latency. 

From  the  data  presented  in  Figure  7,  we  can  estimate  the  data  rate 
per  node  under  various  communications  loads.  The  conflict-free 
rate  for  a  node  is  about  17  megabytes/second  (a  128  byte  subpage 
every  7.6  psec).  As  the  lokl  gets  heavy  within  a  single  ring,  with 
the  ratio  between  requests  for  data  (cache  misses)  and  computation 
of  our  experiments,  the  rate  declines  to  about  15  megabytes/second 
when  all  32  nodes  are  generating  requests.  When  data  moves 
between  rings,  rates  are  much  lower.  The  range  is  5  megabytes  for 
one  node  without  contention  to  about  4  megabytes  when  32  nodes 
are  active,  and  declining  to  less  than  2.5  megabytes  per  second  per 
node  with  a  load  of  60  nodes. 

7.  Related  Work 

Recently  several  other  researchers  have  been  investigating  the  per¬ 
formance  of  the  KSRl.  Boyd  et  al  [1]  show  a  method  of  measur¬ 
ing  communications  performance  on  multiprocessors  using  a  qm- 
tiietic  workload  based  on  matrix  multiplication  of  generated 
matrices.  Other  researchers  [10,  9]  have  also  rqwrted  on  experi¬ 
ments  to  measure  the  performance  effects  of  specific  features  of 
the  KSRl.  The  communication  and  synchronization  performance 
of  the  KSRl  has  been  analyzed  by  Dunigan  [4].  Singh  et  al  [15] 
present  performance  results  of  several  kernel  codes  and  some  of 
the  SPLASH  benchmark  suite  on  die  KSRl  and  DASH  machines. 
Finally,  analytic  model  comparing  the  potential  benchmark  perfor¬ 
mance  of  NUMA  and  COMA  machines  has  been  developed  by 
Hagersten  [5]. 

8.  Conclusions 

Based  on  our  measurements,  it  tqrpears  that  the  KSRl 
ALLCACHE  memory  architecture  should  gracefully  extend  to  a 
large  number  of  processors.  It  indeed  fulfills  its  promise  of  a  scal¬ 
able  shared  memory  architecture.  As  with  any  parallel  machine, 
the  performance  of  parallel  applications  will  depend  on  the  degree 
and  form  of  interactions  between  computations  running  on 
separate  nodes  of  the  machine.  From  our  results,  it  is  clear  that 
there  are  types  of  shared  access  that  are  expensive,  and  that  the 
programmer  should  be  aware  of  the  costs  of  accessing  data,  espe¬ 
cially  at  larger  strides,  that  reside  on  a  different  node  from  the 
accessing  node. 

The  overall  performance  of  any  machine  is  a  combination  of  its 
many  features.  The  ALLCACHE  memory  is  only  one  component 
of  the  KSRl.  In  addition  to  the  processor,  the  ring  of  rings  inter¬ 
connect  is  a  major  element.  The  architecture  is  interesting  and 
effective.  Our  results  do  not  permit  us  to  give  a  relative  evaluation 
of  the  machine  in  comparison  with  other  architectures,  but  our  data 
wilL  we  believe,  help  potential  users  in  deciding  whether  to  use 
die  machine,  and  how  to  use  it  effectively. 
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Fig.  C-1:  KSR1  2-Node  Physical  and  Performance  Profile  (Fig.  5). 
The  projection  is  taken  from  point  {-2.5,  -1.7,  +0.5}. 
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Rg.  C-3:  KSR1  1-Node  Physical  and  Performance  Profile  (Fig.  4). 
The  projection  is  taken  from  point  {-2.5,  -1.7, +0.5}. 
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Rg.  C-5:  KSR1  2-Node  Physical  and  Performance  Profile  (Fig.  5). 
Cache  misses  satisfied  in  tocal  ringO  areshown. 


Fig.  C-2:  KSR1  2-Node  Physical  and  Performance  Profile  (Fig.  5) 
The  projection  is  taken  from  point  {-2.5,  +1.7,  +0.5}. 
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Fig.  C-4:  KSR1  1-Node  Physical  and  Performance  Profile  (Fig.  4). 
The  projection  is  taken  from  point  {-2.5,  +1.7,  +0.5}. 


Fig.  C-6:  KSR1  1-Node  Physical  and  Performance  Profile  (Fig.  4) 
The  effect  of  misses  with  page  replacement  is  shown. 


